Visualization Final Project

Project:

The Lord of the Rings Trilogy Analysis

Authors: (Surname, Name: Matriculation, No.)

Nguyen, Thanh: 14622459;

Tran Ortega, Daniel: 17518031;

Sun, Qumeng: 24821883;

Date:

July, 2022

Pre-processing

Overview

The Lord of the Rings is an epic literary work by Oxford University professor and linguist J.R.R. Tolkien. It was published in three volumes and directed into a film series by Peter Robert Jackson. The trilogy of films has successfully made more people aware of the "Lord of the Rings" series. One of the most influential works of fantasy literature of the 20th century, The Lord of the Rings has sold over 150 million copies in various editions.

This article will chart this literary work in several ways. This paper can be divided into three parts. First of all, this article will show its extraordinary achievements as a film series, and compare the status of the trilogy in the commercial market and in the professional field.

This essay will take the Lord of the Rings as the theme, and create related charts in various aspects of this literary work. It can be divided into three parts.

First of all, this article will show its extraordinary achievements as a film series, and compare the status of the trilogy in the commercial market and in the professional field. After that, we will do sentiment analysis on the lines of the film and television works. Find out the happiest and most negative characters in the Lord of the Rings trilogy by analyzing the lines of different characters. Finally, we'll visualize the demographic data in the Lord of the Rings universe. Show the proportions of races in the world and the regions in which they live.

Part A - The Lord of the Rings Film Trilogy

I. Motivation

The Lord of the Rings Trilogy (2001 - 2003) has long been considered by both critics as well as fans all over the world as one of if not the greatest epic fantasy adventure ever in cinema history. It is a series of 3 box office hits and critically acclaimed movies directed by Peter Jackson.

20 years have passed since the first installment premiered yet the trilogy's influences in the film industry, the literature sphere and naturally, the general pop culture are still prominent. For instance, Amazon is going to release a new franchise of the Middle Earth universe on their streaming platform this September. The series's total budget is reported at a record-breaking amount of 1 billion USD, making it the most expensive television series ever made.

Thus, considering its perpetual significant relevance, we would like to apply what we have learned throughout this course and analyze the success of the trilogy as well as explore many fascinating aspects of Middle Earth with the powerful tool of Visualization.

II. Research Question

Although all 3 films are beloved by fans and critics alike, there is always a question of which one is the best. Or put it in the Middle Earth language, which film will rule them all?

In this part of our project, we attempt to join the debate and provide our own conclusion through data analysis. Therefore, the research question is:

III. Literature review

Considering the combining impacts of the original books by the late Professor J. R. R. Tolkien as well subsequent adaptations in cinematic, gaming, musical along with many more industries, it goes without saying that there have been numerous studies on The Lord of the Rings related aspects.

For our limited exploration into scientific publications in this regard, we notice the prominent topics include storytelling, linguistics, philosophy or even tourism, for example, Shefrin (2014), Bahmani (2021), Kreeft (2005), Li (2017)..., respectively. As for the question of our research interest, one would most likely find it in documentaries, individual video essays, passion projects or online forums.

Thus, we set up our goal for this one practically as a passion project while trying our best to utilize as many scientific methods as possible.

IV. Data

For this part, our primary source of information is Box Office Mojo, where we collected the numbers such as the movies' budgets, box office revenues, theatrical release periods. Afterward, we visit 3 of the most popular film review websites, namely Rotten Tomatoes, IMDB, Metacritic to extract their respective scores for the trilogy.

V. Methodology and Result

This section describes simultaneously the methods, which are mainly descriptive data analysis, together with the reasonings we employed and illustrates our findings with visualizations.

Firstly, we would argue that the measurement of a film's success should not align only with financial successes, but also have to take into account the audience's reception and the merits it receives. With this reasoning, we come up with 3 criteria and thus subset our data, accordingly:

1. Production value

We define our production value of a film as the investment returns to its producers through box office revenues and the dominance in theatre the film exerts during its first release.

Let us take an overall view of how much profits the trilogy generated. To do so, we use a typical financial measurement ROI (Return On Investment) by dividing the profit earned on an investment by the cost of that investment. The results show that the last installment is by far the most beneficial, followed by its predecessors in reverse order. For every USD spent, The return of King earned 12 USD back to its investors.

Subsequently, we calculate the Box Office Performance Score by the ROI share of each film relative to the other 2 in the trilogy.

This measurement shows the ratio of the days that a film was the box office number 1 over its entire theatrical release. Since the international data does not suffice for the whole trilogy, say we have the data in Brazil for the first film but not the second one, not to mention that there can be biases as the total numbers of days in theater are not evenly distributed across countries and across movies themselves, we decide to use only the "domestic data". By that, we mean the first release period (the trilogy has had many, even till now) only in the United States.

The Box Office Dominance Score is calculated in the same logic as above, which is the share of each film in the trilogy.

Finally, we average out the Box Office Performance Score and the Box Office Dominance Score to obtain the Production Value Score.

In the stacked group bar chart below, we can see that although the 3 films had basically the same budgets at 93-94 million USD, the Return of King was significantly more commercially successful being the second film ever that broke the 1 billion mark after James Cameron's Titanic (1997). That said, these budgets do not really reflect their production costs in real life as all 3 films were shot simultaneously and were heavily supported by the locals, especially the New Zealand government for promoting their own tourism.

As for the Box Office Dominance, all 3 films were well received as proven by both total days as number 1 at the box office and their extensive release period, ranging from 240-242 days, which are exceptionally long compared to the average (about 1-3 months in theaters). However, The Return of the King has the best stats here as it had the longest period as number 1 despite the least shown time among the 3, 33/240 days. The Two Towers, on the contrary, has the weakest ratio: 28/244.

-> Verdict: overall, we have a dominant winner of this category, which is the last film, followed by the second one, which is only marginally ahead of the first film.

2. Accolade

Accolade is the measurement of merits that the movies received. Here we simplified our variables by just considering Oscar and the other awards with the classification of actually winning one or just being nominated.

Considering the fact that a winner speaks more volumes than a nominee, we put twice as much weight on every win count. In other words, 1 win = 2 nominations. Summing them up with weights, we obtain 2 new variables, namely OscarWeight and OtherWeight. From there, the same logic applies as ScoreOscar and ScoreOther are the shares of each film in the trilogy and ScoreAccolade is the average of those.

The diverging bar plot below once again indicates an outstanding feat by The Return of the King winning 100% of its Oscar nominations, making it the most Oscar-winning film of all time, sharing the throne with Titanic (1997) and Ben-Hur (1959). Although having fewer Oscar wins, The Fellowship of the Ring has more nominations, which is already a feat by itself considering the most Oscar-nominated film, again Titanic (1997), has but 1 more nomination, at 14.

On the other hand, The Two Towers did not fare well at the Oscar, at least in comparison to its siblings, the second installment was much more successful at other awards both in terms of nomination and win counts.

-> Verdict: overall, it is difficult to identify a dominant winner in the category. In absolute numbers, The Fellowship of the Ring and The Two Towers all share 43 nominations with 12 and 20 wins, respectively; while The Return of King won 27 out of fewer nominations at 40. Thus, opportunity conversion-wise, The Return of King is a clear winner, plus its Oscar feat. However, The Two Towers outperformed both its sequel and prequel on other merits. Unfortunately, since its performance at the Oscar was significantly behind the other 2, our metric puts it last in this category. The second place goes to a consistent contestant, The Fellowship of the Ring, while the King still remains strong on his throne.

3. Reception

Last but not least, we look at how the films have been received by the viewers as their legacies can only withstand the test of time as long as there are audiences who still hold them in their hearts.

The audiences here comprise both critics as well as the fans and common film viewers. Thus we turn to Rotten Tomatoes where the reviews are given by professional critics; IMDB where anyone can make an account to vote (or downvote); and Metacritic which is a combination of both.

Again, we calculate their respective shares in the trilogy and then average them out for the final score.

The connected scatterplot below clearly is in favor of The Return of the King with the only non-first rank coming from Rotten Tomatoes putting it in second place.

The Fellowship of the Ring, once again, stays consistently in second place.

The Two Towers has its up and down taking all 3 positions at some point.

-> Verdict: The Return of the King is again the winner, although barely while the other 2 are almost absolutely on par with each other with just a negligible lead for The Fellowship of the Ring.

VI. Final verdict

To sum up, we put all 3 movies on top of each other in a radar chart to compare them across 7 aforementioned sub-categories.

Visually speaking, The Return of the King either shares the first place or just simply wipes the floor with the other 2 in all but 1 category, non-Oscar awards, which go to The Two Towers. On the contrary, The Fellowship of the Ring has some edge over its second brother when it comes to Oscar.

For the final verdict, we condense those 7 sub-categories into just 3 main criteria as described in the beginning, namely Production value, Accolade and Reception.

The Return of the King is slightly better in Reception while significantly beating the other 2 fair and square in both of the remaining criteria. The Two Towers marginally has an edge over The Fellowship of the Ring in terms of Production Value but lacks Accolades to compete for second place.

We then derive the final score by taking the average of all 3 main criteria. Lastly, we convert these scores to a scale of 100 points, which go to the best one, then rank the other 2 accordingly.

--> Final verdict: The Fellowship of the Ring and The Two Towers are a close call for second place yet due to a relative underperformance at the Oscar, the latter runs out of steam and loses the race.

As for the title the best of the trilogy, we have a clear winner: The Return of the King!

Part B - Script Analysis of the Lord of Rings

This is the second part of this article. In this section, the article will analyse the lines of the Lord of the Rings trilogy. Two bar charts will be drawn to express how many lines each character has, and what proportion of them are in the different films. Afterwards, we will analyse the emotional tendencies in the lines with the help of NLTK, and we take the average of the emotional scores of each line as the emotional score of a character.

Note: NLTK is a leading platform for building Python programs to work with human language data. The Vader library can be used to rate the sentiment of words or sentences, with whole numbers representing positive sentiment and negative numbers representing negative sentiment. After comparing it with natural language processing libraries such as "NRC Emotion Lexicon", we chose NLTK.vader, which can analyse text passages.

The file "script.csv" contains multiple columns with categorical, text data. One additional column is supposely an index. Overall the data describes what each character in the Lord of the Rings trillogy said in a given movie.

1.1 Tidying data

A first step to subequently analyse the data is to clean so that working with given data is feasible.

In the following code chunk multiple functions were created to clean the script. It's a rather tedious endeavour where punctuation, whitespace, upper/lowercase and other aspects about language were "cleaned".

Additionally, in the char column we had overlapping characters or simple data recording errors that needed fixing.

After data cleaning, we can obtain more usable data. Also, by removing words that have no sentiment value, we can reduce the number of confounding items in the sentiment analysis.

1.2 Further Data Manipulation

For the following visualisations we require the data to be structured differently and manipulated. By imposing a limit on the amount of lines, we removed characters with too small a sample size of lines. We'll end up having a grouped and reduced data set with three columns. We'll have characters, movie and the count.

Here we widen the data set. Each movie appears as a column containing the amount of dialogue lines spoken per character.

1.3 Horizontal Barchart spoken lines per character

Visualising how many dialogue lines each characters speaks. Very straight forward.

Same approach, just splitting each bar by movie.

1.4 Sentiment Analysis

Using the same data set as before, now we are trying to compute the sentiment of each character determined by their lines spoken in the trillogy.

1.4.2 Constructing sentiment analysis function

The following code chunk provides a function allowing us to compute and return the sentiment value given by the polarity score that is provided by nltk library.

Compute sentiment score for each character and manipulating the such that visualisations are feasible

1.5 Constructing a Treemap

Constructing a treemap that has shows the sentiment score by size of rectangle and by colour if it's either a positive or negative sentiment. It's to no suprise that Grima and the combination of all lines spoken by orcs have the most negative score. However, it is a suprising finding that Legolas and Sam are negative as well. Characters such as Theoden, Denethor or Elrond are quite pissimistic, so they are expected to be on the negative side of sentiment. A deeper analysis of what Legolas or Sam have to say during the movie is necessary to follow validate the findings. In contrast, the jolly Hobbit Bilbo score the highest positive sentiment. To no suprise, given that he wasn't in any dangerous position and spent most of his lines talking about his birthday and celebration. What is a very interesting finding is that Gollum and Smeagol are infact on opposite halves.

1.6 Constructing a bubble chart

Constructing a bubble chart for the sentiment score. It is considerably worse and is redundant. Might remove for final version

2 Sentiment and Age

We have obtained an emotional score for each character by examining the emotional disposition of each line. In this section we will average the emotional scores for each word and then try to show how the age of the characters in Lord of the Rings relates to their emotional state.

Our hypothesis is that older people will be calmer and have less emotional turmoil in their lines. In the graph, they will be around the 0 cut-off.

2.1 Pre-Processing

First, we define a function to find the sentiment score. In contrast to the previous method, here we aggregate all the words spoken by the character and then average them.

We analysed all the above characters with more than 20 lines of dialogue. There are 25 characters in total. Of these we have analysed Orcs as a character.

We have found estimates of the age of these characters on the internet(sources are shown in the reference), not in terms of their lifespan, but in terms of their age at the time of their appearance in the film and television. It should be noted that as the books and films are not explicit about the age of certain characters when they appear, part of the data has an element of guesswork.

We have done the data cleaning before, so there is no need to do it again here.

2.2 Sentiment analysis

We do the sentiment analysis word by word this time.

We found a slight difference between the results obtained from sentence-by-sentence and word-by-word analyses. The reason for this may be that we live with more zeros in word-by-word analysis. more zeros means greater robustness and less response to outliers.

2.3 Normalisation of data.

We provide two methods of normalising the data here. The Z-Score in the library organises the data so that they are distributed over the interval [-1,1], thus preserving their positive and negative signs. We have written our own alternative normalisation method to redistribute the data over the interval (0,1).

We will cross-use these two methods later in the plotting and data processing.

The figure below shows the distribution of our sentiment score data after processing through the Z-score.

2.4 The relationship between age and mood scores.

This double y-axis graph shows our data, and it is difficult to see a clear relationship between age and mood scores.

The data used here for both age and mood scores are unprocessed, as this makes it more obvious that there is no clear relationship between them and that the fluctuations in each are unpredictable.

2.5 Difference between age and mood score

In this graph we have used a bar chart structure that was used earlier in the article. We have arranged the data in order from largest to smallest in an attempt to summarise the pattern of the difference between age and emotional orientation.

We used the normalized() function to scale the age and score data to between (0,1) so that we could better compare their proportional relationship. Once again, we do not see any obvious pattern.

2.6 The relationship between age and emotional expression

Finally, we have represented visually the relationship between a character's tendency to express emotions and his or her age through a scatter chart containing the avatars of each character. We will be able to see that characters of different ages are more likely to express more negative, more positive or flatter emotions.

Here, we did not normalise the data for age. We normalised the Z-score for sentiment scores stand for tendency to express different emotions.

It is worth mentioning that we have changed the scale of the X-axis by applying a scale change to the X-axis. We applied the "symlog" here, it means symmetrical log, and allows positive and negative values. It also allows to set a range around zero within the plot will be linear instead of logarithmic. This facilitates us to see more clearly where more data points are located, even if there are some very exaggerated anomalies (e.g. Treebeard, this guy is over 7000 years old).

Characters located above the red dotted line have a largely positive emotional expression and are more likely to express positive emotional words. Characters below the red line, on the other hand, express more negative words. The closer to the red line, the weaker their tendency, and conversely, the further away from the red line, the stronger their tendency to choose a certain type of word.

Result

As we can see from the chart, most of the characters are located near the red line and are within 180 years of each other.

We can observe that the older characters on the right hand side of the picture are indeed closer to the red line as we would expect, which means that their emotional expressions are generally more subtle and not clearly inclined.

Moreover, almost all characters older than 500 do not have a tendency to express more positive words. The characters including Grima, Legolas, and Elrond were the most negatively expressed of all the characters in the group. They are all older than 1000. Therefore, it seems as if older people are more willing to express negative emotions in the world of Lord of the Rings.

At a glance we can see several anomalies. at over 7000 years old, Treebeard is the oldest of these characters. Bilbo, on the other hand, expresses his emotions very strongly. Although he is 129 years old, he is still the character who expresses the most positive emotions and the strongest among them.

What can be said is that young people express their emotions more strongly than older people, both in a negative and positive way. They have no clear emotional orientation, and a significant number of characters are expressing relatively positive emotions, as well as a significant number of characters expressing relatively negative emotions.

Part C - Demographic Analysis of the Lord of the Rings

Same procedure here just that we now use a data set containing names of characters, their realm, what age/year they have been born into, when the died and what race they belong to. Important to note is that the dataset includes mentioned characters in the entire universe of Tolkien. What is not showcased by the data is the demographic data in that sense that we have the entire population recorded.

2.1 Cleaning Data

Here we have to specially deal with missing data and reoccuring commata or other recording errors

2.2 Race Data Cleaning and Manipulation

Removing all entries that have a missing ("NaN") for race and unifying entries to combine categories.

2.3 Visualising Race Distribution

The following code creates a doughnut chart for the race distribution among mentioned characters in the Tolkien Universe

2.4 Cleaning Realm Data

Cleaning data in such a way that only realms from the Third Age are still available in the data. For the visualisation we can feasibly use a map of the Third Age only.

2.5 Manipulating Realm Data

Manipulating data of the realms so we get the amount of characters from a certain realm. Additionally adding coordinates that fit the map image.

2.6 Visualising Characters and their Realms by Bubbles on a Map of Middle Earth

Using a scatterplot to create in size flexible points/bubbles to showcase where on Middle Earth characters mentioned in the Tolkien Universe are from.

Reference

Overview

Gilsdorf, Ethan. Lord of the Gold Ring. The Boston Globe. November 16, 2003 [2006-06-16].

Wagner, Vit. Tolkien proves he's still the king. The Star. 2007-04-16 [2011-04-24].

Part A

https://github.com/MokoSan/FSharpAdvent/blob/master/Data/Movies.csv

https://medium.com/@mukund.sharma92/the-lord-of-the-rings-an-f-approach-the-path-of-the-hobbits-f2b84cfab859

https://matplotlib.org/stable/gallery/lines_bars_and_markers/horizontal_barchart_distribution.html#sphx-glr-gallery-lines-bars-and-markers-horizontal-barchart-distribution-py

https://matplotlib.org/stable/gallery/specialty_plots/radar_chart.html

https://www.boxofficemojo.com/?ref_=bo_nb_gr_mojologo

https://www.imdb.com/

https://www.metacritic.com/

https://www.rottentomatoes.com/

Part B

Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

https://github.com/tianyigu/Lord_of_the_ring_project

https://fictionhorizon.com/how-old-are-the-characters-in-lord-of-the-rings-from-youngest-to-oldest-character-age/

https://lotr.fandom.com/wiki

https://screenrant.com/lord-rings-fellowship-characters-ages-how-old/

https://www.statista.com/statistics/384102/age-of-selected-characters-in-lord-of-the-rings/

https://imgur.com/355G0pm

Part C

[OC] Topographic Map of Middle Earth - Updated! @t_dolstra https://www.reddit.com/r/lotr/comments/h7ecs7/oc_topographic_map_of_middle_earth_updated/